AITopics | survey data

Collaborating Authors

survey data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Consistency of the $k$-Nearest Neighbor Regressor under Complex Survey Designs

Hasler, Caren

arXiv.org Machine LearningMar-19-2026

We study the consistency of the $k$-nearest neighbor regressor under complex survey designs. While consistency results for this algorithm are well established for independent and identically distributed data, corresponding results for complex survey data are lacking. We show that the $k$-nearest neighbor regressor is consistent under regularity conditions on the sampling design and the distribution of the data. We derive lower bounds for the rate of convergence and show that these bounds exhibit the curse of dimensionality, as in the independent and identically distributed setting. Empirical studies based on simulated and real data illustrate our theoretical findings.

artificial intelligence, bmn, machine learning, (18 more...)

arXiv.org Machine Learning

2603.17551

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Texas (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.88)

Add feedback

Distributional Random Forests for Complex Survey Designs on Reproducing Kernel Hilbert Spaces

Zou, Yating, Matabuena, Marcos, Kosorok, Michael R.

arXiv.org Machine LearningDec-10-2025

We study estimation of the conditional law $P(Y|X=\mathbf{x})$ and continuous functionals $Ψ(P(Y|X=\mathbf{x}))$ when $Y$ takes values in a locally compact Polish space, $X \in \mathbb{R}^p$, and the observations arise from a complex survey design. We propose a survey-calibrated distributional random forest (SDRF) that incorporates complex-design features via a pseudo-population bootstrap, PSU-level honesty, and a Maximum Mean Discrepancy (MMD) split criterion computed from kernel mean embeddings of Hájek-type (design-weighted) node distributions. We provide a framework for analyzing forest-style estimators under survey designs; establish design consistency for the finite-population target and model consistency for the super-population target under explicit conditions on the design, kernel, resampling multipliers, and tree partitions. As far as we are aware, these are the first results on model-free estimation of conditional distributions under survey designs. Simulations under a stratified two-stage cluster design provide finite sample performance and demonstrate the statistical error price of ignoring the survey design. The broad applicability of SDRF is demonstrated using NHANES: We estimate the tolerance regions of the conditional joint distribution of two diabetes biomarkers, illustrating how distributional heterogeneity can support subgroup-specific risk profiling for diabetes mellitus in the U.S. population.

conditional distribution, finite population, random forest, (17 more...)

arXiv.org Machine Learning

2512.08179

Country:

North America > United States (0.48)
Europe > Austria > Vienna (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Explainable Federated Learning for U.S. State-Level Financial Distress Modeling

Carta, Lorenzo, Spadea, Fernando, Seneviratne, Oshani

arXiv.org Artificial IntelligenceNov-13-2025

We present the first application of federated learning (FL) to the U.S. National Financial Capability Study, introducing an interpretable framework for predicting consumer financial distress across all 50 states and the District of Columbia without centralizing sensitive data. Our cross-silo FL setup treats each state as a distinct data silo, simulating real-world governance in nationwide financial systems. Unlike prior work, our approach integrates two complementary explainable AI techniques to identify both global (nationwide) and local (state-specific) predictors of financial hardship, such as contact from debt collection agencies. We develop a machine learning model specifically suited for highly categorical, imbalanced survey data. This work delivers a scalable, regulation-compliant blueprint for early warning systems in finance, demonstrating how FL can power socially responsible AI applications in consumer credit risk and financial inclusion.

artificial intelligence, machine learning, prediction, (18 more...)

arXiv.org Artificial Intelligence

2511.08588

Country:

North America > United States > District of Columbia (0.25)
North America > United States > New York (0.15)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance > Economy (0.88)
Banking & Finance > Credit (0.67)
Government > Regional Government > North America Government > United States Government (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.75)

Add feedback

InsurAgent: A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance

Geng, Ziheng, Liu, Jiachen, Cao, Ran, Cheng, Lu, Frangopol, Dan M., Cheng, Minghui

arXiv.org Artificial IntelligenceNov-5-2025

Flood insurance is an effective strategy for individuals to mitigate disaster-related losses. However, participation rates among at-risk populations in the United States remain strikingly low. This gap underscores the need to understand and model the behavioral mechanisms underlying insurance decisions. Large language models (LLMs) have recently exhibited human-like intelligence across wide-ranging tasks, offering promising tools for simulating human decision-making. This study constructs a benchmark dataset to capture insurance purchase probabilities across factors. Using this dataset, the capacity of LLMs is evaluated: while LLMs exhibit a qualitative understanding of factors, they fall short in estimating quantitative probabilities. To address this limitation, InsurAgent, an LLM-empowered agent comprising five modules including perception, retrieval, reasoning, action, and memory, is proposed. The retrieval module leverages retrieval-augmented generation (RAG) to ground decisions in empirical survey data, achieving accurate estimation of marginal and bivariate probabilities. The reasoning module leverages LLM common sense to extrapolate beyond survey data, capturing contextual information that is intractable for traditional models. The memory module supports the simulation of temporal decision evolutions, illustrated through a roller coaster life trajectory. Overall, InsurAgent provides a valuable tool for behavioral modeling and policy analysis.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.02119

Country: North America > United States (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Banking & Finance > Insurance (1.00)
Government > Regional Government > North America Government > United States Government (0.68)
Education > Educational Setting > K-12 Education (0.46)
Education > Educational Setting > Higher Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Target Population Synthesis using CT-GAN

Rastogi, Tanay, Jonsson, Daniel

arXiv.org Artificial IntelligenceOct-2-2025

Agent-based models used in scenario planning for transportation and urban planning usually require detailed population information from the base as well as target scenarios. These populations are usually provided by synthesizing fake agents through deterministic population synthesis methods. However, these deterministic population synthesis methods face several challenges, such as handling high-dimensional data, scalability, and zero-cell issues, particularly when generating populations for target scenarios. This research looks into how a deep generative model called Conditional Tabular Generative Adversarial Network (CT-GAN) can be used to create target populations either directly from a collection of marginal constraints or through a hybrid method that combines CT-GAN with Fitness-based Synthesis Combinatorial Optimization (FBS-CO). The research evaluates the proposed population synthesis models against travel survey and zonal-level aggregated population data. Results indicate that the stand-alone CT-GAN model performs the best when compared with FBS-CO and the hybrid model. CT-GAN by itself can create realistic-looking groups that match single-variable distributions, but it struggles to maintain relationships between multiple variables. However, the hybrid model demonstrates improved performance compared to FBS-CO by leveraging CT-GAN ability to generate a descriptive base population, which is then refined using FBS-CO to align with target-year marginals. This study demonstrates that CT-GAN represents an effective methodology for target populations and highlights how deep generative models can be successfully integrated with conventional synthesis techniques to enhance their performance.

artificial intelligence, machine learning, target population, (19 more...)

arXiv.org Artificial Intelligence

2510.00871

Country: Europe > Sweden (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.45)

Add feedback

Dimension Agnostic Testing of Survey Data Credibility through the Lens of Regression

Basu, Debabrota, Chakraborty, Sourav, Chanda, Debarshi, Das, Buddha Dev, Ghosh, Arijit, Ray, Arnab

arXiv.org Machine LearningAug-29-2025

Assessing whether a sample survey credibly represents the population is a critical question for ensuring the validity of downstream research. Generally, this problem reduces to estimating the distance between two high-dimensional distributions, which typically requires a number of samples that grows exponentially with the dimension. However, depending on the model used for data analysis, the conclusions drawn from the data may remain consistent across different underlying distributions. In this context, we propose a task-based approach to assess the credibility of sampled surveys. Specifically, we introduce a model-specific distance metric to quantify this notion of credibility. We also design an algorithm to verify the credibility of survey data in the context of regression models. Notably, the sample complexity of our algorithm is independent of the data dimension. This efficiency stems from the fact that the algorithm focuses on verifying the credibility of the survey data rather than reconstructing the underlying regression model. Furthermore, we show that if one attempts to verify credibility by reconstructing the regression model, the sample complexity scales linearly with the dimensionality of the data. We prove the theoretical correctness of our algorithm and numerically demonstrate our algorithm's performance.

artificial intelligence, machine learning, regression model, (16 more...)

arXiv.org Machine Learning

2508.20616

Country:

North America > United States (0.93)
Europe (0.92)

Genre: Research Report (1.00)

Industry:

Government (0.92)
Energy > Oil & Gas (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.92)

Add feedback

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

Khaokaew, Yonchanok, Salim, Flora D., Züfle, Andreas, Xue, Hao, Anderson, Taylor, MacIntyre, C. Raina, Scotch, Matthew, Heslop, David J

arXiv.org Artificial IntelligenceApr-18-2025

Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.0826

Country: North America > United States (1.00)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Health & Medicine > Therapeutic Area > Vaccines (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

From Perceptions to Decisions: Wildfire Evacuation Decision Prediction with Behavioral Theory-informed LLMs

Chen, Ruxiao, Wang, Chenguang, Sun, Yuran, Zhao, Xilei, Xu, Susu

arXiv.org Artificial IntelligenceFeb-24-2025

Evacuation decision prediction is critical for efficient and effective wildfire response by helping emergency management anticipate traffic congestion and bottlenecks, allocate resources, and minimize negative impacts. Traditional statistical methods for evacuation decision prediction fail to capture the complex and diverse behavioral logic of different individuals. In this work, for the first time, we introduce FLARE, short for facilitating LLM for advanced reasoning on wildfire evacuation decision prediction, a Large Language Model (LLM)-based framework that integrates behavioral theories and models to streamline the Chain-of-Thought (CoT) reasoning and subsequently integrate with memory-based Reinforcement Learning (RL) module to provide accurate evacuation decision prediction and understanding. Our proposed method addresses the limitations of using existing LLMs for evacuation behavioral predictions, such as limited survey data, mismatching with behavioral theory, conflicting individual preferences, implicit and complex mental states, and intractable mental state-behavior mapping. Experiments on three post-wildfire survey datasets show an average of 20.47% performance improvement over traditional theory-informed behavioral models, with strong cross-event generalizability. Our complete code is publicly available at https://github.com/SusuXu-s-Lab/FLARE

evacuate 0, evacuation decision, prediction, (11 more...)

arXiv.org Artificial Intelligence

2502.17701

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > Colorado > Boulder County (0.04)
North America > United States > California > Sonoma County (0.04)
(5 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.46)

Industry:

Law Enforcement & Public Safety > Fire & Emergency Services (0.46)
Information Technology (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions

Suh, Joseph, Jahanparast, Erfan, Moon, Suhong, Kang, Minwoo, Chang, Serina

arXiv.org Artificial IntelligenceFeb-23-2025

Large language models (LLMs) present novel opportunities in public opinion research by predicting survey responses in advance during the early stages of survey design. Prior methods steer LLMs via descriptions of subpopulations as LLMs' input prompt, yet such prompt engineering approaches have struggled to faithfully predict the distribution of survey responses from human subjects. In this work, we propose directly fine-tuning LLMs to predict response distributions by leveraging unique structural characteristics of survey data. To enable fine-tuning, we curate SubPOP, a significantly scaled dataset of 3,362 questions and 70K subpopulation-response pairs from well-established public opinion surveys. We show that fine-tuning on SubPOP greatly improves the match between LLM predictions and human responses across various subpopulations, reducing the LLM-human gap by up to 46% compared to baselines, and achieves strong generalization to unseen surveys and subpopulations. Our findings highlight the potential of survey-based fine-tuning to improve opinion prediction for diverse, real-world subpopulations and therefore enable more efficient survey designs. Our code is available at https://github.com/JosephJeesungSuh/subpop.

arxiv preprint arxiv, language model, subpopulation, (14 more...)

arXiv.org Artificial Intelligence

2502.16761

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Health & Medicine > Therapeutic Area > Immunology (0.46)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction

Ball, Sarah, Allmendinger, Simeon, Kreuter, Frauke, Kühl, Niklas

arXiv.org Artificial IntelligenceFeb-22-2025

Generative AI (GenAI) is increasingly used in survey contexts to simulate human preferences. While many research endeavors evaluate the quality of synthetic GenAI data by comparing model-generated responses to gold-standard survey results, fundamental questions about the validity and reliability of using LLMs as substitutes for human respondents remain. Our study provides a technical analysis of how demographic attributes and prompt variations influence latent opinion mappings in large language models (LLMs) and evaluates their suitability for survey-based predictions. Using 14 different models, we find that LLM-generated data fails to replicate the variance observed in real-world human responses, particularly across demographic subgroups. In the political space, persona-to-party mappings exhibit limited differentiation, resulting in synthetic data that lacks the nuanced distribution of opinions found in survey data. Moreover, we show that prompt sensitivity can significantly alter outputs for some models, further undermining the stability and predictiveness of LLM-based simulations. As a key contribution, we adapt a probe-based methodology that reveals how LLMs encode political affiliations in their latent space, exposing the systematic distortions introduced by these models. Our findings highlight critical limitations in AI-generated survey data, urging caution in its use for public opinion research, social science experimentation, and computational behavioral modeling.

arxiv preprint arxiv, entropy, prompt sensitivity, (11 more...)

arXiv.org Artificial Intelligence

2502.1628

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Maryland (0.04)
Europe > Germany > Bavaria > Upper Franconia > Bayreuth (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Government > Voting & Elections (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback